A comparison of manual and automated neural architecture search for white matter tract segmentation

Segmentation of white matter tracts in diffusion magnetic resonance images is an important first step in many imaging studies of the brain in health and disease. Similar to medical image segmentation in general, a popular approach to white matter tract segmentation is to use U-Net based artificial neural network architectures. Despite many suggested improvements to the U-Net architecture in recent years, there is a lack of systematic comparison of architectural variants for white matter tract segmentation. In this paper, we evaluate multiple U-Net based architectures specifically for this purpose. We compare the results of these networks to those achieved by our own various architecture changes, as well as to new U-Net architectures designed automatically via neural architecture search (NAS). To the best of our knowledge, this is the first study to systematically compare multiple U-Net based architectures for white matter tract segmentation, and the first to use NAS. We find that the recently proposed medical imaging segmentation network UNet3+ slightly outperforms the current state of the art for white matter tract segmentation, and achieves a notably better mean Dice score for segmentation of the fornix (+ 0.01 and + 0.006 mean Dice increase for left and right fornix respectively), a tract that the current state of the art model struggles to segment. UNet3+ also outperforms the current state of the art when little training data is available. Additionally, manual architecture search found that a minor segmentation improvement is observed when an additional, deeper layer is added to the U-shape of UNet3+. However, all networks, including those designed via NAS, achieve similar results, suggesting that there may be benefit in exploring networks that deviate from the general U-Net paradigm.


SUPPLEMENTARY INFORMATION
: Full names of white matter tracts, and the corresponding abbreviation used in this paper.
Supplementary Figure S1: U-Net model architecture. Each u n node concatenates its inputs channel-wise before applying convolution operations. Number above each node indicates the number of filters for all convolution operations within that node, and (if applicable) the number of filters used by the transpose convolution that is feeding into that node. All convolutions use 'same' padding. BN indicates batch normalisation, and ReLU indicates rectified linear unit activation.
Supplementary Figure S2: DS-U-Net model architecture. Number above each node indicates the number of filters for all convolution operations within that node. Number of filters used by transpose convolutions are indicated in underlined text. All convolutions use 'same' padding. BN indicates batch normalisation, and ReLU indicates rectified linear unit activation.
Supplementary Figure S3: UNet++ model architecture. All inputs to each h i , j are concatenated channel-wise into a single volume before the convolution operations are applied. Number above each node indicates the number of filters for all convolution operations within that node. All convolutions use 'same' padding. BN indicates batch normalisation, and ReLU indicates rectified linear unit activation.
Supplementary Figure S4: Attention U-Net model architecture. Attention gate performs 1 × 1 convolution on each of its inputs, then sums the resulting volumes element-wise, followed by a ReLU, 1 × 1 convolution, batch normalisation, and application of a sigmoid function. All 3 × 3 convolutions use 'same' padding. Number above each node indicates the number of filters for all convolution operations within that node. BN indicates batch normalisation, and ReLU indicates rectified linear unit activation.
Supplementary Figure S5: UNet3+ architecture. All convolution operations are 3 × 3 with 'same' padding. Numbers above each node indicate the number of filters for all convolution operations within the node. All skip connections (except those between the input and d 1 , u 1 and o 1 , and o 1 and the output) scale their data via bilinear upsampling or max pooling to match the dimensions of the data at the destination node. This scaling is followed by a 3 × 3 convolution with 64 filters, which precedes the channel-wise concatenation in each u n . BN indicates batch normalisation, and ReLU indicates rectified linear unit activation.
Supplementary Figure S6: UNet3+ architecture for model with depth 2. All convolution operations are 3 × 3 with 'same' padding. Numbers above each node indicate the number of filters for all convolution operations within the node. All skip connections (except those between the input and d 1 , u 1 and o 1 , and o 1 and the output) scale their data via bilinear upsampling or max pooling to match the dimensions of the data at the destination node. This scaling is followed by a 3 × 3 convolution with 64 filters, which precedes the channel-wise concatenation in each u n . BN indicates batch normalisation, and ReLU indicates rectified linear unit activation.
Supplementary Figure S7: UNet3+ architecture for model with depth 3. All convolution operations are 3 × 3 with 'same' padding. Numbers above each node indicate the number of filters for all convolution operations within the node. All skip connections (except those between the input and d 1 , u 1 and o 1 , and o 1 and the output) scale their data via bilinear upsampling or max pooling to match the dimensions of the data at the destination node. This scaling is followed by a 3 × 3 convolution with 64 filters, which precedes the channel-wise concatenation in each u n . BN indicates batch normalisation, and ReLU indicates rectified linear unit activation.
Supplementary Figure S8: UNet3+ architecture for model with depth 4. All convolution operations are 3 × 3 with 'same' padding. Numbers above each node indicate the number of filters for all convolution operations within the node. All skip connections (except those between the input and d 1 , u 1 and o 1 , and o 1 and the output) scale their data via bilinear upsampling or max pooling to match the dimensions of the data at the destination node. This scaling is followed by a 3 × 3 convolution with 64 filters, which precedes the channel-wise concatenation in each u n . BN indicates batch normalisation, and ReLU indicates rectified linear unit activation.
Supplementary Figure S9: UNet3+ architecture for model with depth 6. All convolution operations are 3 × 3 with 'same' padding. Numbers above each node indicate the number of filters for all convolution operations within the node. All skip connections (except those between the input and d 1 , u 1 and o 1 , and o 1 and the output) scale their data via bilinear upsampling or max pooling to match the dimensions of the data at the destination node. This scaling is followed by a 3 × 3 convolution with 64 filters, which precedes the channel-wise concatenation in each u n . BN indicates batch normalisation, and ReLU indicates rectified linear unit activation.